Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
نویسندگان
چکیده
The variety and complexity of accents pose a huge challenge to robust Automatic Speech Recognition (ASR). Some previous work has attempted address such problems, however most the current approaches either require prior knowledge about target accent, or cannot handle unseen accent-unspecific standard speech. In this work, we aim improve multi-accent speech recognition in end-to-end (E2E) framework with novel layer-wise adaptation architecture. Firstly, propose deep accent representation learning architecture obtain accurate embedding, some advanced schemes are designed further boost quality embeddings, including phone posteriorgram (PPG) feature, TTS based data augmentation training stage, test-time multi-embedding fusion testing stage. Then, embeddings is developed for fast ASR, two types adapter layers designed, gated layer multi-basis layer. Compared usual two-pass adaptation, these injected between ASR encoder encode information flexibly, perform adaption on corresponding accent. experiments Accent AESRC corpus show that proposed can capture knowledge, get high performance classification. new embedding outperforms other traditional methods, obtains consistent $\sim$15% relative word error rate (WER) reduction all kinds scenarios, seen accents,
منابع مشابه
End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition
End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-ofthe-art performance on a number of tasks and significantly simplify the modeling, training and decodi...
متن کاملEnd-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملEnd-to-End Speech Recognition Models
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional independence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect betw...
متن کاملMultichannel End-to-end Speech Recognition
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend ...
متن کاملTowards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2022
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2022.3198546